Skip to main content
Version: 2.0

Resilience

Resilience. It’s a buzzword thrown around a lot these days, particularly when discussing systems architecture. We talk about building resilient systems – fault tolerance, redundancy, graceful degradation. But what about building resilient engineering teams? And more importantly, resilient engineering leaders?

I recently witnessed a team paralyzed by a critical production issue. Not because the problem was technically insurmountable, but because the single engineer who understood the affected service was unavailable. This underscored a painful truth: technical resilience is only half the battle. The ability to bounce back from failures, adapt to change, and maintain performance under pressure isn't just about clever code; it’s fundamentally about the people building that code, and the leadership that supports them.

The Fragility of "Normal"

We operate under the illusion of predictable environments. We plan sprints, forecast timelines, and assume a certain level of stability. But the last few years have demonstrated just how fragile that "normal" can be. From global pandemics to rapid economic shifts, to the unpredictable volatility of emerging technologies, disruption is constant.

This isn't just about reacting to crises; it’s about proactively building systems—both technical and human—that expect disruption. Think about it: we've become so focused on optimization – squeezing every ounce of performance and efficiency – that we've often stripped away the buffers that absorb unexpected shocks. This applies to our codebases and our teams.

What Does Engineering Resilience Look Like?

Resilient engineering isn't about preventing all failures (that’s unrealistic). It’s about how you respond when things inevitably go wrong. I see it manifest in a few key areas:

  • Psychological Safety: This is foundational. Teams need to feel safe admitting mistakes, challenging assumptions, and proposing unconventional solutions without fear of retribution. A blameless postmortem isn’t just a process; it’s a demonstration of trust. Teams with high psychological safety recover far faster from incidents than those operating in a culture of fear. Google’s Project Aristotle research highlighted psychological safety as a key component of high-performing teams.
  • Distributed Ownership: Avoid single points of failure, not just in architecture, but in knowledge and responsibility. A team where only one person understands a critical service is a fragile team. Encourage cross-training, documentation, and shared responsibility. This also means empowering engineers to make decisions – within defined boundaries, of course.
  • Embracing Iteration & Experimentation: Resilient teams aren’t afraid to try new things and, yes, sometimes fail. In fact, failure is expected as a learning opportunity. This requires a mindset shift away from perfectionism and towards continuous improvement. Encourage small, frequent experiments to validate assumptions and mitigate risks.
  • Adaptability & Learning: The pace of technological change is relentless. Resilient engineers are lifelong learners, constantly seeking new knowledge and skills. As leaders, we must foster a culture of learning and provide opportunities for growth. This could include dedicated learning time, conference attendance, or internal knowledge sharing sessions.

Building on this concept of shared responsibility, resilient teams also embrace experimentation…

Leading with Resilience: It Starts with You

Building a resilient team starts with you, the engineering leader. Here are a few things I’ve found effective:

  • Model Vulnerability: Share your own mistakes and lessons learned. It creates space for others to do the same. If you always project an image of flawless competence, you’ll stifle the very behaviors you’re trying to encourage. It can be challenging, especially for those in leadership positions, but starting small with trusted team members can be a good approach.
  • Practice Active Listening: Truly listen to your team’s concerns and perspectives. Understanding their challenges is crucial for building trust and identifying potential vulnerabilities.
  • Prioritize Value Delivery: Perfection is often the enemy of progress. In many cases, a “good enough” solution delivered quickly is preferable to a perfect solution delivered too late. This requires making thoughtful trade-offs and prioritizing based on risk and impact.
  • Prioritize Wellbeing: Burnout is a significant threat to engineering resilience. Engineering leaders face specific pressures, including unrealistic deadlines, constant context switching, and the complexities of managing remote teams. Encourage healthy work-life balance, promote self-care, and be mindful of workload distribution.

Beyond Technical Debt: Accumulating Resilience Debt

Just like technical debt, we can accumulate “resilience debt” by consistently prioritizing short-term gains over long-term stability. Ignoring psychological safety, neglecting knowledge sharing, or pushing teams to their breaking points creates vulnerabilities that will inevitably surface during times of stress. Consider resilience debt not as a separate issue, but as a lens through which to view all these practices – a reminder that neglecting the human aspects of our teams will inevitably come at a cost.

As leaders, we need to proactively address this resilience debt by investing in the human aspects of our teams. This isn’t a “nice to have” – it’s a critical investment in our long-term success.

Building resilient engineering teams isn’t about eliminating all risk; it’s about preparing for the inevitable disruptions and empowering your team to navigate them effectively. It’s about recognizing that the strength of your system lies not just in the code, but in the people who build and maintain it. And it starts with leading with vulnerability, empathy, and a commitment to building a culture of continuous learning and adaptation.

Key Takeaways:

  • Psychological Safety is Foundational: Create an environment where team members feel safe admitting mistakes and challenging assumptions.
  • Distribute Ownership: Avoid single points of failure in both technical systems and knowledge.
  • Embrace Experimentation: Encourage learning through small, frequent experiments.
  • Prioritize Wellbeing: Recognize and address the factors that contribute to burnout.
  • Invest in Resilience: Proactively address “resilience debt” by prioritizing the human aspects of your team.